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Abstract 

In this paper we propose to use the Winner Takes All 
hashing technique to speed up forward propagation and 
backward propagation in fully connected layers in convo¬ 
lutional neural networks. The proposed technique reduces 
significantly the computational complexity, which in turn, 
allows us to train layers with a large number of kernels with 
out the associated time penalty. 

As a consequence we are able to train convolutional neu¬ 
ral network on a very large number of output classes with 
only a small increase in the computational cost. To show the 
effectiveness of the technique we train a new output layer on 
a pretrained network using both the regular multiplicative 
approach and our proposed hashing methodology. Our re¬ 
sults showed no drop in performance and demonstrate, with 
our implementation, a 7 fold speed up during the training. 

1. Introduction 

Convolutional Neural Networks, CNN, have recently 
achieved state of the art performance in a number of com¬ 
puter vision tasks [5, 8]. Even more remarkably, it has been 
shown that using the output of 7th layer of the network 
(FC7) when trained on the ImageNet benchmark [3] as a 
generic feature descriptor it is possible to achieve higher 
performance than traditional handcrafted features from the 
computer vision literature [8]. 

One of the main challenges when working with large net¬ 
works is the computational cost during training. In fact, 
many different efforts have emerged to increase the speed of 
these networks. This includes using special commands on 
CPUs [10], using GPUs (as in frameworks such as Torch, 
Caffe and Theano), and using FPGAs [4] [7]. Although 
these techniques have had great success in reducing training 
and testing time for deep neural nets, the amount of compu¬ 
tations required has remained constant. This is because they 



Figure 1. Distribution of computational cost on AlexNet [5] as we 
vary the number of output classes from (a) 1,000, (b) 10,000 and 
(c) 100,000. In red we show the computational cost with our pro¬ 
posed method. 


have focused on executing these operations faster and/or in 
parallel, instead of reducing the number of needed compu¬ 
tations. 

Figure l.a shows the distribution of the computational 
cost of the deep neural network from [5] which is the ba¬ 
sis of many of current approaches in computer vision. Al¬ 
though the computational cost of the classification layer in 
these approaches is small compared to the rest of the net¬ 
work, this can change easily as the number of classes in¬ 
creases. The reason for this is that the number of kernels 
in a classification layer scales linearly with the number of 
classes. 

For example, as we increase the number of classes in 
(figure, l.a) to 10,000 (figure l.b) or 100,000 (figure l.c) 
the cost of the output layer quickly dominates the compu¬ 
tational cost of the network. Furthermore, this cost would 
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become even more pronounced if the output size of layer 7 
is increased, since, the computation complexity of a layer is 
linear in both the input size and the output size. 

In this paper we introduce a hashing-based technique 
that dramatically reduces the computational cost of training 
and testing of sparse fully connected convolutional layers. 
The red bar in hgure. l.b-c shows the computational cost of 
computing the output when using our approach. This im¬ 
provement allows using the same network architecture on a 
very large number of classes without signihcantly increas¬ 
ing the computational cost. 

To this end we make use of 2 main observations: (1) the 
most computationally expensive operation for these fully 
connected layers is matrix multiplication which occurs both 
during forward propagation and backpropagation, and (2) 
when propagating through layers whose output is trained 
for classihcation, units that have low activation would have 
performed computations that will not be used. This is be¬ 
cause we are only interested in units that are ’’turned on” and 
which will lead to changes in the output. The same is true 
for backpropagation, units with extremely low values will 
have very low slope and can be ignored or aggregated. This 
reasoning holds true for any layer with a non-linearity func¬ 
tion that applies a cut off on the output values but the extra 
computation is more pronounced in layers that are trained 
to be sparse. 

Although several methods have been proposed for more 
efficient approaches to matrix multiplication [6] these have 
not been yet applied to neural networks and, in most cases, 
current implementations only lead to a modest saving in 
computational complexity of vs O(n^). As a 

consequence, the current implementations of neural net¬ 
works have a computational cost that grows linearly with 
the number of units. One place where this cost is very tan¬ 
gible is the hnal classihcation layer where the number of 
units corresponds to the number of expected classes. 

Here we propose to use winner take all (WTA) hashing to 
identify the units that will have sufficiently high amplitude 
before performing any expensive computation. Then, only 
for the identihed units, we will compute the exact output 
of these units, while using a default low value for the rest 
of the units. This way only a small number of elements in 
the layer output matrix need to be computed while the rest 
are set to a predehned value. This WTA technique has been 
already used in vision tasks. In particular,it was success¬ 
fully applied on a HOG-based object detector [ 1 ] in order 
to detect 100.000 object classes. 

Furthermore, during the backpropagation phase, we only 
backpropagate through units that were activated. This is 
similar to using drop out. We also backpropagate through 
units that were not activated but should have been. 

The reason for backpropagating through activated units 
is to minimize the computational cost. The expected output. 


after the non-linearity layer, is sparse. Also, the gradient of 
the loss will have the same sparsity pattern. Therefore, most 
of the errors fed back on the non-activated nodes would ac¬ 
tually be near zero or exactly zero, depending on the type of 
non-linearity used. 

To underline the signihcance of this approach, the ap¬ 
plication of this technique for training neural networks to 
classify into 10k classes is shown and a 7 fold speed-up in 
training the layer is demonstrated. 

2. Winner takes all hashing 

Winner Take All (WTA) hashing is a method that trans¬ 
forms the input feature space into binary codes. In the 
resulting space Hamming distance closely correlates with 
rank similarity measure [11]. The obtained binary descrip¬ 
tors show a degree of invariance in front of slight perturba¬ 
tions of the original data, which makes the method a suitable 
basis for retrieval algorithms. The WTA is based on random 
permutations on the data components and does not require 
any data-driven learning. 

Figure 2 summarizes the computation of one WTA hash. 
First we permute the input vector [xi, 2 : 2 ,..., xk] by in¬ 
dexing the incoming data through the permutation arrays. 
A different permutation is used for each section a and for 
each hash i. In each section the Ng hrst elements of the per¬ 
muted vector is selected: [a;i, 2 : 2 ,..., These are com¬ 

pared and the index of the largest element is recorded, ha. 
We compute Ng sections for each hash. These sections are 
all concatenated to form one single hash hi. 

For instance, if we have a vector x = 
[0.2, 0.9, 0.4, 0.5, 0.1], and use a two-band hash with 
permutations Pi = [3, 2,4, 5,1] and P 2 = [4,1,5, 3,4], 
and rib = 4, the hrst indices of each permutation are 
[3,2,4, 5], [4,1,5,3] respectively. The selected elements 
are [0.4, 0.9, 0.5, 0.1] and [0.5,0.2,0.1,0.4]. The maximum 
values are 0.9 and 0.5, and the respective indices 2 and 1. 
The resulting concatenated band is [0110] (least signihcant 
bit leftmost). 

3. Speeding up Neural Networks computations 

with WTA hashing 

Neural networks layers usually have sparse activation 
patterns. The output values of the units in these layers are 
mostly a constant value, usually zero, while a few units will 
have output values that are larger. These patterns are ob¬ 
tained either explicitly using sparse regularization terms or 
implicitly because of the data that the layer is trying to learn. 

LI regularization is an example of training that explic¬ 
itly leads to sparsity. In contrast, an example of sparsity re¬ 
sulting from data includes training the classihcation layer, 
where the expected data is sparse. 
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Figure 2. Computation of one hash for one vector. We first per¬ 
mute the vector that is to be hashed. Then we compare the first Ne 
elements. The index of the highest element is recorded, concate¬ 
nation of Ns indexes derives one hash value. 


When training a layer to be sparse, the non-linear func¬ 
tion following the multiplication has no parameters to learn 
and the sparsity constraint is fed back to the weights in¬ 
volved in the multiplication. How the weights achieve spar¬ 
sity depends on the type of non-linearity layer used. 

Usually the non-linear function has an output of zero or 
near zero for values that are less than a given threshold. To 
achieve sparsity in these layers, the weights would learn to 
arrive at outputs that are mostly less than the threshold value 
and only occasionally higher than the threshold. Therefore, 
to be able to correctly compute the output of the non-linear 
layer, we hypothesize that only the elements that are larger 
than the threshold need to be identified and computed. 

The same concept can be applied to the backpropagation 
of the error term. As the gradient of the non-linearity is ex¬ 
tremely small and possibly zero before the threshold, only 
the error values coming from the activated units need to be 
computed. In this paper we propose an efficient learning 
algorithm that ignores the computations related to the back- 
propagation of the remaining units. 

Correspondingly, the most computationally expensive 
part of activation propagation in a layer of units is the mul¬ 
tiplication of weights of the layer by the inputs of that layer 
and summing it together. Conversely, the most computa¬ 
tionally expensive operations during back propagation is 
multiplication of the output error by the weights to obtain 
the input error and the multiplication of the output error by 


the input values to obtain the weight errors. The next sub¬ 
sections will describe how our approach speeds up these 
two computationally expensive parts of forward and back- 
propagation by identifying units that will have large values 
and only computing the values and errors of these units. 

Note that in this work we are concerned with propagating 
batches of inputs through the layers. The reasons for this are 
the following; 

1. The training algorithm used for the neural network is 
the stochastic gradient mini-batch descent, which is 
quite popular in the current literature. In this approach, 
at each iteration, the gradient of the loss of the network 
is computed over a batch of the training samples and a 
step is taken in the opposite direction of the gradient. 

2. Processing a batch of inputs favors the computational 
economy. Although the runtime order of complex¬ 
ity remains the same there can be significant constant 
gains depending on the architecture used for the com¬ 
putation. This is more pronounced for parallel archi¬ 
tectures such as GPUs. 

3. The hashing approach is most efficient when batches 
of inputs are processed since, as with other hash based 
algorithms, a rehash is required every time the weights 
in the layer change. This will be elaborated on when 
discussing complexity in section 4. 

3.1. Forward propagation 

In a normal neural layer, propagation of one batch of M 
samples through a layer with N output units requires the 
computation of the value of the output units yji for each pair 
of output unit z G {1,..., A^} and sample j G M}. 

This is done by summing the product of the K weights of 
the jth unit by the K input unit values for the zth sample as 
is demonstrated in 

Vji — ^ ^ (1) 

where Wik is the network parameter corresponding with the 
link connecting the /cth input unit with the zth output, and 
Xjk is the input value of the fcth input unit for the jth sample. 

The formulation given in equation 1 for computing the 
forward pass can be succinctly expressed in a matrix multi¬ 
plication, one form of which can be 

Y = XW^, (2) 

where Y, W, and X represent the aggregation of the out¬ 
puts, the layer parameters and the inputs in matrix format. 

In our approach, we propose to calculate only a little por¬ 
tion of the matrix multiplication in equation 2. We are in¬ 
terested in finding the elements in each row of the output 















Figure 3. Forward propagataion: calculating the output of a layer using the WTA hashing. 


that have the highest value. These elements correspond to 
columns in the matrix that result in a high value when 
multiplied a specific row in the X matrix. 

To do this we first use WTA hashing to identify units 
(columns in W^) that will be active for each row of matrix 
X. Figure 3 depicts an overview of our proposed approach. 
We represent actived units by ajs which stores the index of 
the sth activated unit for the jth sample. 

After the activated units are found, only the values corre¬ 
sponding to these units are computed for each sample. That 
is, for a given sample j, corresponding to a row in matrix 
X, that row is only multiplied by the columns ajs', s G l..^ 
in W^. The effectiveness of the method is highlighted by 
contrasting this procedure against the full matrix multipli¬ 
cation where each row in X is multiplied by all columns in 
W^. 

3.2. Selection of active units 

Figure 4 outlines the process of computing the indexes 
of the active units ajs for the jth sample. To do this first the 
WTA hashes of the units are computed. All the hashes are 
computed in a similar fashion, but according to different 
permutation arrays. Subsequently, we obtain Q different 
hash values for each input vector. In the provided figure, 
hig represents the hash value of the weights of the fth unit 
as given by the gth hash. 


After computing the hash values for all the weights of 
the units, a multi hash table is constructed for each of the Q 
hashes. In each multi hash table q the id of the fth unit is 
assigned to bin hig. Therefore each hash partitions the set 
of units according to their corresponding hash values under 
that permutation. This binning process is repeated for each 
permutation array so as to arrive at a set of Qhash tables. 

After binning, the same permutation arrays are used to 
compute the hashes of the input samples leading to the val¬ 
ues hjg. These hash values are then used to lookup the bins 
in their corresponding hash table. The contents of the re¬ 
trieved bins constitute the votes for possible active units. 
For each input sample these votes are counted and the top 
voted units are selected as the active units. In the figure 4 
the Qju is the index of the uth activated unit for the jth input 
sample. A total of A units are chosen per input 

3.3. Backward propagation 

When backpropagating the error terms, we are faced 
with the two matrix multiplications depicted in figure 5. 
One of this multiplications is performed to estimate the the 
partial derivatives of the loss with respect to the weights 
for each output unit i and each input unit k. The other 
multiplication estimates the partial derivatives of the loss 
with respect to the inputs for each input sample j 


























































index of output| ^ 

^ctive units per sample \ 


*Look up 
hashtable 


*Count votes 


*Select active 

y 

units 



ail 012 
021 022 


O-MA 


Figure 4. Finding the activated unit. 
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Figure 5. Regular backpropagation 


and each input unit k: 
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(4) 


where is the loss wrt the values of the output units. 

Again computing these two sums of multiplications can 
be computationally costly given the size of the neural lay¬ 


ers. To overcome this we will only use the elements in 
whose corresponding units were activated or have a positive 
partial derivative as shown in figure 6. 

4. Computational Complexity 

Each propagation through a fully-connected layer, con¬ 
sists of three multiplications with the sizes M,K, and N, 
where M is the number of inputs in one batch of samples 
and K is the size of the input layer and N is the size of 
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Figure 6. The process of finding the choosing the units that lead to the highest results. 


the output layer. With this the computational complexity of 
passing a single batch of samples through a normal neural 
layer becomes: 

TnormaliM, K, N) = 0{MKN). (5) 

In a hashing layer the computational complexity of mul¬ 
tiplying the inputs by their selected units is 0{MKA) 
where A is the number of units selected for activation for 
each sample. In addition to this, for each single sample, we 
need to compute the hashes and also match the hashes. This 
has a computational complexity of 0{MQV), with Q as the 
total number of hashes used and V as the average number 
of votes that a single hash bin assigns. 

In this approach only the computational complexity of 
placing the output units into the hash table depends on the 
number of output units. Because we need to hash all units 
to appropriate bins for all different hashes, the complexity 
of this operation is 0{NQ). With this, the total computa¬ 
tion complexity of training on one batch of inputs using the 


hashed layer becomes: 

ThashediM, K, N) = 0{MKA + MQV + NQ). (6) 

One significant result of this, compared with the time 
complexity of a normal layer in equation 5, is that the vari¬ 
able N has been decoupled from the variables K and M. 
The decoupling from K means that if we increase the size 
of the input and output units proportionately, the complex¬ 
ity will only increase linearly where as originally it would 
increase quadratically. 

Furthermore the decoupling between M and N means 
increasing the size of the batches of samples, M, in pro¬ 
portion to the number of output units N, can result in an 
amortized value of Q for the term NQ over all the samples 
processed. 

The speed ups obtained in this work are related to both 
approaches because, in our experiments, the size of the layer 
next to last is proportional to the size of the last layer. With 
the explanations given in this section we expect that scaling 















































































the size of the last layer even more would lead to the second 
factor becoming dominant in the amount of savings. 

5. Experiments and results 

The configuration that is commonly used for classifica¬ 
tion networks is to have a final layer that has the same num¬ 
ber of output units as classes. Therefore this layer becomes 
cumbersome when the number of classes increases and the 
hashing layer becomes a suitable substitute for this layer. 

To validate our proposal, we used a pretrained network 
to extract the features up to the final layer. The experiment 
shows that our WTA hashing layer leads to faster train¬ 
ing times with competitive accuracy in comparison to the 
exhaustive matrix multiplication in this last classification 
layer. 

To arrive at these results, we trained the network from [5] 
on the ILSVRC2012 dataset [9]. We used this trained net¬ 
work to extract features from the imagenetlOk [2]. To do 
this the test set and the training set were fed through the 
network and we extracted the FC7 features (4096 features). 
These features were then used to train a new single output 
layer of units that would classify the input into the 10184 
output categories according to a softmax approach. 

The single output layer was trained using two ap¬ 
proaches, one was the proposed hashing algorithm and the 
other using the regular matrix multiplication technique. For 
the proposed hashing approach we used 8 elements Ng per 
section and 3 sections. Ns, per hash. Furthermore 256 
hashes, Q, were computed for hashing a given unit. 

To perform the test the traditional multiplication tech¬ 
nique was used for both networks so that the test approach 
remains the same. 

We tested the network on a separate testing set and 
recorded the top-1 accuracy ratio. The results, as a func¬ 
tion of training time, are shown in figure 7. 

From the figure we can see that both approaches behave 
similarly but our approach converges about 7 times faster. 
In table 1 we show a comparison of the computation time 
of the forward and the backward propagation for the tra¬ 
ditional multiplication approach and our proposed hashed 
technique. 

To assess the behavior of the hashed layer during for¬ 
ward propagation, one final experiment was performed. We 
first extracted the FC7 features from the ILSVRC2012 test- 
set [9] using the same network as the first experiment. We 
then copied the weights of the output layer (FC8) from the 
already trained network to a network that only consists of 
one hashing layer followed by a softmax classifier. 

The new network was used to classify 50000 samples 
of the ILSVRC2012 testing dataset using the extracted fea¬ 
tures. We repeated this several times while changing the 
main parameters of the hashing layer. The accuracy and 



Figure 7. Comparing the time required to train a single layer of 
a classification network on the imagenetlOk [2]. The x axis rep¬ 
resents training time and the y axis represent the accuracy of the 
network. 
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Table 1. Time required to perform one weight update for one layer 
of units. Our approach is compared to the normal multiplication 
approach. The same parameters as the first experiment in the paper 
is used. 

computation time of each execution was recorded and they 
are shown in figure 8 and figure 9 respectively. 

As expected, increasing the number of hashes and ac¬ 
tive units, both result in increased accuracy at the expense 
of more required time. But increasing the number of active 
units results in more penalty because its impact is propor¬ 
tional to the size of the input units. 

On the other hand, the computation of more hashes is 
relatively cheap and increasing the number of hashes in fa¬ 
vor of reducing the number of activated units results in a 
speedup without compromising performance. 































6. Discussion and conclusions 



Figure 8. The accuracy of the hashed layer on the ILSVRC2012 
dataset [9]. Compared with respect to the number of activated 
units, A, and number of hashes used, Q. 



Figure 9. The forward propagation time required for the hashing 
layer on the ILSVRC2012 dataset [9], with respect to the number 
of activated units, A, and number of hashes used, Q. 


Notice that using 1024 hashes to select 32 active units 
(which correspond to 3% of the units) results in much less 
required time while achieving 99.9% of the maximum per¬ 
formance. 


In this paper we introduced a hashing approach to speed 
up the learning of the parameters of the fully-connected lay¬ 
ers of a Neural Network. The proposal reduces significantly 
the computational complexity of both forward and back¬ 
ward propagation. Also, a particular CPU implementation 
of the algorithm that speeds up the computations by a factor 
of 7 is introduced. 

We will provide access to the CPU implementation used 
in this paper. We should note that the provided package 
is not a straight forward implementation of the algorithm 
given in this article. To compete with the modern ma¬ 
trix multiplication packages and realize the potential of the 
modern computation machines we needed to attend to mem¬ 
ory coalescing, the use of special multi-data, and cache 
management instructions. 

Our ultimate goal is applying the technique to the whole 
network. Therefore, an important future work is the exten¬ 
sion of this approach to conventional layers in CNNs. With 
this we anticipate achieving the same type of speedup while 
training the whole network. 

In line with our main goal, we are interested in testing 
the effect of the proposed approach on the all the layers that 
are fully connected during training. That is we would like 
to speedup layers 6 and 7, as well as the classification layer. 

Furthermore, our immediate endeavor also includes the 
completion of our GPU-CUDA implementation. This 
would allow us to train larger networks, much faster than 
the current state of the art. 
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